Zhang Z, Xie Y, Yang L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6199-6208.

1. Overview

1.1. Motivation

fully end-to-end map from low-dimension text space to a high-resolution image space still remains unsolved
two difficulty
- balance the convergence between G and D
- stably model the huge pixel space in high-resolution images and guaranteeing semantic consistency

In this paper, it proposed an extensille single-stream generator architecture (HDGAN)

jointed discriminator. regularize mid-level representation and assist generator training to capture the complex image statistics
multi-purpose adversarial loss
new visual-semantic similarity measure
single stage
no multiple text condition
no additional class label supervision

1.2. Dataset

CUB birds
Oxford-102 flowers
MSCOCO

1.3.1. Generative Models

1.3.2. Text-to-Image

(ICML 2016) GAN
(NIPS 2016) GAN what-where network
(ICCV 2017) StackGAN
(ICCV 2017) joint embedding
perceptual loss
auxiliary classifier
attention-driven

1.3.3. Stability of GAN

training techniques
regularization using extra knowledge
combination of G and D

As the targeting image resolution increases, training difficulty increases.

1.3.4. Decompose into Multiple Subtasks

LAP-GAN
symmetric G and D
stage-by-stage

2. Methods

2.1. Hierarchical-nested Adversarial Objective

G. hierarchical generator
z. noise
t. sentence embedding by pre-trained char-RNN text encoder
s. number of scales
X_n. gradually growing resolution
lower-resolution. learn semantic consistent image structure
higher-resolution. render fine-grained details

2.2. Multi-purpose Adversarial Loss

pair loss (1). guarantee the global semantic consistency
Image loss (R_i x R_i). low-resolution D focus on global structures, high focus on local image details
output.
two type of errors
- real image + mismatched text
- fake image + conditioned text

2.2.1. D

2.2.2. G

2.2.3. Conditioning Augmentation

Instead of directly using the deterministic text embedding, sample a stochastic vector from a Gaussian distribution

And add Kullback-Leiblere divergence regularization term to prevent over-fitting and force smooth

2.3. Architecture

2.3.1. G

three modules
- K-repeat ResBlock. 2 Conv + BN-ReLU
- stretching layers. x2 nearest upsample + Conv-BN-ReLU
- linear compression layers. Conv-Tanh

structure:1-2-1-2-….

text embedding of CA. 1024 x 4 x 4

2.3.2. D

stride-2 conv + BN-LeakyReLU
two branch following
- FCN produce R_i x R_i probability map
- concat 512 x 4 x 4 feature map and 128 x 4 x 4 text embedding (reduced), 1x1 Conv + 4x4 Co

3. Experiments

3.1. Metrics

Inception Score. need pre-trained Inception model (fine-tuned on dataset of this paper)
Multi-scale Structural Similarity (MS-SSIM). pairwise similarity, lower score indicates higher diversity of generated image
VIsual-semantic Similarity. train a visual-semantic embedding model to measure the distance between text and image

δ. margin set to 0.2

3.2. Comparison

better preserve semantically consistent information in all resolution

3.3. Style Transfer

3.4. Ablation Study

local image loss for higher performantic and offer the pair loss more focus on learning sementic consistency

not observe benifits